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Abstract: Variable selection and dimension reduction are two commonly adopted approaches for 
high-dimensional data analysis, but have traditionally been treated separately. Here we propose an 
integrated approach, called sparse gradient learning (SGL), for variable selection and dimension reduc- 
tion via learning the gradients of the prediction function directly from samples. By imposing a sparsity 
constraint on the gradients, variable selection is achieved by selecting variables corresponding to non- 
zero partial derivatives, and effective dimensions are extracted based on the eigenvectors of the derived 
sparse empirical gradient covariance matrix. An error analysis is given for the convergence of the es- 
timated gradients to the true ones in both the Euclidean and the manifold setting. We also develop 
an efficient forward-backward splitting algorithm to solve the SGL problem, making the framework 
practically scalable for medium or large datasets. The utility of SGL for variable selection and feature 
extraction is explicitly given and illustrated on artificial data as well as real- world examples. The main 
advantages of our method include variable selection for both linear and nonlinear predictions, effective 
dimension reduction with sparse loadings, and an efficient algorithm for large p, small n problems. 
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1. Introduction 

Datasets with many variables have become increasingly common in biological and physical sciences. In 
biology, it is nowadays a common practice to measure the expression values of tens of thousands of genes, 
genotypes of millions of SNPs, or epigenetic modifications at tens of millions of DNA sites in one single 
experiment. Variable selection and dimension reduction are increasingly viewed as a necessary step in dealing 
with these high-dimensional data. 

Variable selection aims at selecting a subset of variables most relevant for predicting responses. Many 
algorithms have been proposed for variable selection [1]. They typically fall into two categories: Feature 
Ranking and Subset Selection. Feature Ranking scores each variable according to a metric, derived from 
various correlation or information theoretic criteria [1-3], and eliminates variables below a threshold score. 
Because Feature Ranking methods select variables based on individual prediction power, they are ineffective 
in selecting a subset of variables that are marginally weak but in combination strong in prediction. Subset 
Selection aims to overcome this drawback by considering and evaluating the prediction power of a subset 
of variables as a group. One popular approach to subset selection is based on direct object optimization, 
which formalizes an objective function of variable selection and selects variables by solving an optimization 
problem. The objective function often consists of two terms: a data fitting term accounting for prediction 
accuracy, and a regularization term controlling the number of selected variables. LASSO proposed by [4] 
and elastic net by [5] are two examples of this type of approach. The two methods are widely used because 
of their implementation efficiency [5, 6] and the ability of performing simultaneous variable selection and 
prediction, however, a linear prediction model is assumed by both methods. The component smoothing and 
selection operator (COSSO) proposed in [7] try to overcome this shortcoming by using a functional LASSO 
penalty. However, COSSO is based on the framework of smoothing spline ANOVA which makes it impossible 
to deal with high dimensional data. 

Dimension reduction is another commonly adopted approach in dealing with high-dimensional data. Root- 
ing in dimension reduction is the common belief that many real- world high-dimensional data are concentrated 
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on a low-dimensional manifold embedded in the underlying Euclidean space. Therefore mapping the high- 
dimensional data into the low-dimensional manifold should be able to improve prediction accuracy, to help 
visualize the data, and to construct better statistical models. A number of dimension reduction methods have 
been proposed, ranging from principle component analysis to manifold learning for non-linear settings [8-13]. 
However, most of these dimension reduction methods are unsupervised, and therefore are likely sub-optimal 
with respect to predicting responses. In supervised settings, most recent work focuses on finding a subspace 
S such that the projection of the high dimensional data x onto S captures the statistical dependency of the 
response y on x. The space S is called effective dimension reduction (EDR) space [14]. 

Several methods have been proposed to identify EDR space. The research goes back to sliced inverse 
regression (SIR) proposed by Li [15], where the covariance matrix of the inverse regression is explored for 
dimension reduction. The main idea is that if the conditional distribution p(y\x) concentrates on a subspace S, 
then the inverse regression E(x\y) should lie in that same subspace. However, SIR imposes specific modeling 
assumptions on the conditional distribution p{y\x) or the regression E(y\x). These assumptions hold in 
particular if the distribution of x is elliptic. In practice, however, we do not necessarily expect that x will 
follow an elliptic distribution, nor is it easy to assess departures from cllipticity in a high-dimensional setting. 
A further limitation of SIR is that it yields only a one-dimensional subspace for binary classifications. Other 
reverse regression based methods, including principal Hessian directions (pHd [16]), sliced average variance 
estimation (SAVE [17]) and contour regression [18], have been proposed, but they have similar limitations. 
To address these limitations, Xia et al. [14] proposed a method called the (conditional) minimum average 
variance estimation (MAVE) to estimate the EDR directions. The assumption underlying MAVE is quite 
weak and only a scmiparametric model is used. Under the semiparametric model, conditional covariance is 
estimated by linear smoothing and EDR directions are then estimated by minimizing the derived conditional 
covariance estimation. In addition, a simple outer product gradient (OPG) estimator is proposed as an initial 
estimator. Other related approaches include methods that estimate the derivative of the regression function 
[19, 20]. Recently, Fukumizu et al. [21] proposed a new methodology which derives EDR directly from a 
formulation of EDR in terms of the conditional independence of x from the response y, given the projection 
of x on the EDR space. The resulting estimator is shown to be consistent under weak conditions. However, 
all these EDR methods can not be directly applied to the large p, small n case, where p is the dimension 
of the underlying Euclidean space in which the data lies, and n is the number of samples. To deal with the 
large p, small n case, Mukherjee and co-workers [22, 23] introduced a gradient learning method (which will 
be referred to as GL) for estimating EDR by introducing a Tikhonov regularization term on the gradient 
functions. The EDR directions were estimated using the eigenvectors of the empirical gradient covariance 
matrix. 

Although both variable selection and dimension reduction offer valuable tools for statistical inference 
in high-dimensional space and have been prominently researched, few methods are available for combining 
them into a single framework where variable selection and dimensional reduction can be done. One notable 
exception is the sparse principle component analysis (SPCA), which produces modified principle components 
with sparse loadings [9], However, SPCA is mainly used for unsupervised linear dimension reduction, our 
focus here is the variable selection and dimension reduction in supervised and potentially nonlinear settings. 
To motivate the reason why a combined approach might be interesting in a supervised setting, consider a 
microarray gene expression data measured in both normal and tumor samples. Out of 20, 000 genes measured 
in microarray, only a small number of genes (e.g. oncogenes) are likely responsible for gene expression changes 
in tumor cells. Variable selection chooses more relevant genes and dimension reduction further extracts 
features based on the subset of selected genes. Taking a combined approach could potentially improve 
prediction accuracy by removing irrelevant noisy variables. Additionally, by focusing on a small number of 
most relevant genes and extracting features among them, it could also provide a more interprctable and 
manageable model regarding genes and biological pathways involved in the carcinogenesis. 

In this article, we extend the gradient learning framework introduced by Mukherjee and co-workers [22, 23], 
and propose a sparse gradient learning approach (SGL) for integrated variable selection and dimension 
reduction in a supervised setting. The method adopts a direct object optimization approach to learn the 
gradient of the underlying prediction function with respect to variables, and imposes a regularization term to 
control the sparsity of the gradient. The gradient of the prediction function provides a natural interpretation 
of the geometric structure of the data [22-25] . If a variable is irrelevant to the prediction function, the partial 
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derivative with respect to that variable is zero. Moreover, for non-zeros partial derivatives, the larger the 
norm of the partial derivative with respect to a variable is, the more important the corresponding variable 
is likely to be for prediction. Thus the norms of partial derivatives give us a criterion for the importance of 
each variable and can be used for variable selection. Motivated by LASSO, we encourage the sparsity of the 
gradient by adding a I 1 norm based regularization term to the objective vector function. Variable selection is 
automatically achieved by selecting variables with non-zero partial derivatives. The sparse empirical gradient 
covariance matrix (S-EGCM) constructed based on the learned sparse gradient reflects the variance of the 
data conditioned on the response variable. The eigenvectors of S-EGCM are then used to construct the EDR 
directions. A major innovation of our approach is that the variable selection and dimension reduction are 
achieved within a single framework. The features constructed by the eigenvectors of S-EGCM are sparse 
with non-zero entries corresponding only to selected variables. 

The rest of this paper is organized as follows. In section 2, we describe the sparse gradient learning 
algorithm for regression, where an automatic variable selection scheme is integrated. The derived sparse 
gradient is an approximation of the true gradient of regression function under certain conditions, which we 
give in subsection 2.3 and their proofs are delayed in Section 3. We describe variable selection and feature 
construction using the learned sparse gradients in subsection 2.4. As our proposed algorithm is an infinite 
dimensional minimization problem, it can not be solved directly. We provide an efficient implementation 
for solving it in section 4. In subsection 4.1, we give a representer theorem, which transfer the infinite 
dimensional sparse gradient learning problem to a finite dimensional one. In subsection 4.3, we solve the 
transferred finite dimensional minimization problem by a forward-backward splitting algorithm. In section 5, 
we generalize the sparse gradient learning algorithm to a classification setting. We illustrate the effectiveness 
of our gradient-based variable selection and feature extraction approach in section 6 using both simulated 
and real- world examples. 



2. Sparse gradient learning for regression 
2.1. Basic definitions 

Let y and x be respectively R-valued and Revalued random variables. The problem of regression is to 
estimate the regression function f p (x) = E(y|x) from a set of observations Z := {(x.;, yi)}™ =1 , where Xj := 
(xj, . . . , x\ ) T £ M p is an input, and yi £ K is the corresponding output. 

We assume the data are drawn i.i.d. from a joint distribution p(x, y), and the response variable y depends 
only on a few directions in K. p as follows 

y = / p (x) + e = g(b[x, fc^x) + e, (1) 

where e is the noise, B = (bi, . . . , b r ) is a p x r orthogonal matrix with r < p, and -E'(elx) = almost surely. 
We call the r dimensional subspace spanned by {bj} r i=1 the effective dimension reduction (EDR) space [14]. 
For high-dimensional data, we further assume that B is a sparse matrix with many rows being zero vectors, 
i.e. the regression function depends only on a subset of variables in x. 

Suppose the regression function / p (x) is smooth. The gradient of f p with respect to variables is 

A quantity of particular interest is the gradient outer product matrix G = (Gy), apxp matrix with elements 

Gij - ("A f&\ , (3) 



" ' \ <).r' ' dxi / 



where px is the marginal distribution of x. As pointed out by Li [15] and Xia et al. [14], under the assumption 
of the model in Eq. (1), the gradient outer product matrix G is at most of rank r, and the EDR spaces are 
spanned by the eigenvectors corresponding to non-zero eigenvalues of G. This observation has motivated 
the development of gradient-based methods for inferring the EDR directions [14, 22, 23], and also forms the 
basis of our approach. 
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2.2. Regularization framework for sparse gradient learning 

The optimization framework for sparse gradient learning includes a data fitting term and a regularization 
term. We first describe the data fitting term. Given a set of observations Z, a commonly used data fitting 
term for regression is the mean square error — X^Lj^J/i — / p (x,;)) 2 . However, because our primary goal 
is to estimate the gradient of / p , we use the first order Taylor expansion to approximate f p by f p (x) ss 
/p(x )+V/p(x )- (x-x ). When xj is close to x*, f p {xj) « 2/i + V/ p (x;) • (Xj -Xj). Define / := (/ x , . . . , 
where = df p /dx J for j = 1, . . . ,p. The mean square error used in our algorithm is 

1 ™ 

^ (/) = ^ E <i (w - % + /to ■ - x *)) 2 ( 4 ) 

considering Taylor expansion between all pairs of observations. Here uf j is a weight function that ensures the 
locality of the approximation, i.e. ui?^ — > when ||xj — Xj|| is large. We can use, for example, the Gaussian 

with standard deviation s as a weight function. Let uj s (x) = cxp{— ^*J> }. Then the weights arc given by 

<j = ^ ( x j ~Xi) = cxp | - ^ Xj 2g2 X '^ | , (5) 

for all i,j = 1, ■ • ■ , n, with parameter s controlling the bandwidth of the weight function. In this paper, we 
view s as a parameter and is fixed in implementing our algorithm, although it is possible to tunc s using a 
greedy algorithm as RODEO in [26]. 

At first glance, this data fitting term might not appear very meaningful for high-dimensional data as 
samples are typically distributed sparsely on a high dimensional space. However, the term can also be 
explained in the manifold setting [25], in which case the approximation is well defined as long as the data 
lying in the low dimensional manifold are relative dense. More specifically, assume X is a d-dimensional 
connected compact C°° submanifold of M. p which is isometrically embedded. In particular, we know that 
X is a metric space with the metric dx and the inclusion map $ : (X,dx) i-> (R p , || • IU) is well defined 
and continuous (actually it is C°°). Note that the empirical data {xi}™ =1 are given in the Euclidean space 
MP which are images of the points {qi}" =1 C X under $ : x; = ^(qi). Then this data fitting term (4) 
can be explained in the manifold setting. From the first order Taylor expansion, when q^ and q^ arc close 
enough, we can expect that y 3 ~ Ui + (Vx/pCli)) v ij)<m where € T^X is the tangent vector such that 
q 3 = exp q . (vij). However, Vij is not easy to compute, we would like to represent the term (^xfp(<li),Vij)qi 
in the Euclidean space M. p . Suppose x = $(q) and £ = $(exp q (u)) for qel and v £ T^X. Since $ is an 
isometric embedding, i.e. d<I>q : T^X h-> T x R p = R p is an isometry for every q€l, the following holds 

(Vjc/(q),«>q = (d$ q (Vx/(q)),d$ q («)V, 

where d$ q (u) sa 0(exp q (u)) — </>(q) = £ — x for v w 0. Applying these relations to the observations Z = 
{(xj,yi)}f =1 and denote / = d$(Vxf) yields 

£z & = E <i (W - % + • - x »)) 2 ■ ( 6 ) 

This is exactly the same as the one in the Euclidean setting. 

Now we turn to the regularization term on V/ p . As discussed above, we impose a sparsity constraint on 
the gradient vector /. The motivation for the sparse constraint is based on the following two considerations: 
1) Since most variables are assumed to be irrelevant for prediction, we expect the partial derivatives of f p 
with respect to these variables should be zero; and 2) If variable x J is important for prediction, we expect 
the function f p should show significant variation along x 5 , and as such the norm of ^f- should be large. 

Thus we will impose the sparsity constraint on the vector (||§jt||, ■ • ■ , ||§jf ||) T £ K p , where || ■ || is a function 
norm, to regularize the number of non-zeros entries in the vector. 
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In this work, we specify the function norm || • || to be || • \\/c, the norm in reproducing kernel Hilbert space 
(RKHS) associated with a Mercer kernel JC(-, •) (see [27] and Section 4.1). The sparsity constraint on the 
gradient norm vector implies that the £q norm of the vector (H/^Iac, ■ • ■ , II/ p I!ac) T should be small. However, 
because the £q norm is difficult to work with during optimization, we instead use the l\ norm of the vector 
[28-30] as our regularization term 

«(/"):= A£ll/ 3 'lk, (7) 

where A is a sparsity regularization parameter. This functional LASSO penalty has been used in [7] as COSSO 
penalty. However, our component here is quite different from theirs, which makes our algorithm useful for 
high dimensional problems. 

The norm || • is widely used in statistical inference and machine learning (sec [31]). It can ensure each 
approximated partial derivative f 3 € Hyc, which in turn imposes some regularity on each partial derivative. 
It is possible to replace the hypothesis space H^- for the vector / in (7) by some other space of vector-valued 
functions [32] in order to learn the gradients. 

Combining the data fidelity term (4) and the regularization term (7), we propose the following optimization 
framework, which will be referred as sparse gradient learning, to learn V/ p 

^ n p 

fz := arg min — V uj s ij (y l - Vj + f(xi) ■ (xj - Xj)) + AV (8) 

A key difference between our framework and the one in [22] is that our regularization is based on l\ 
norm, while the one in [22] is based on ridge regularization. The difference may appear minor, but makes 
a significant impact on the estimated V/ p . In particular, V/ p derived from Eq. (8) is sparse with many 
components potentially being zero functions, in contrast to the one derived from [22], which is comprised of 
all non-zero functions. The sparsity property is desirable for two primary reasons: 1) In most high-dimensional 
real-world data, the response variable is known to depend only on a subset of the variables. Imposing sparsity 
constraints can help eliminate noisy variables and thus improve the accuracy for inferring the EDR directions; 
2) The resulting gradient vector provides a way to automatically select and rank relevant variables. 

Remark 1. The OPG method introduced by Xia et al. [14] to learn EDR directions can be viewed as a 
special case of the sparse gradient learning, corresponding to the case of setting K(x, y) — 5 x ,y and A = in 
Eq. (8). Thus the sparse gradient learning can be viewed as an extension of learning gradient vectors only 
at observed points by OPG to a vector function of gradient over the entire space. Note that OPG cannot be 
directly applied to the data with p > n since the problem is then under determined. Imposing a regularization 
term as in Eq. (8) removes such a limitation. 

Remark 2. The sparse gradient learning reduces to a special case that is approximately LASSO [4] if we 
choose K(x,y) = 5 Xi y and additionally require f(Xi) to be invariant for different i (i.e. linearity assumption). 
Note that LASSO assumes the regression function is linear, which can be problematic for variable selection 
when the prediction function is nonlinear [6]. The sparse gradient learning makes no linearity assumption, 
and can thus be viewed as an extension of LASSO for variable selection with nonlinear prediction functions. 

Remark 3. A related framework is to learn the regression function directly, but impose constraints on the 
sparsity of the gradient as follows 

min W ) a + A£ II o4ll,c. (9) 

/eHe n ' t-— 1 ox 1 

i—l i—1 

This framework is however difficult to solve because the regularization term X)f=i IIJ^IIk: * s both nonsmooth 
and inseparable, and the representer theorem introduced later to solve Eq. (8) cannot be applied here. Note 
that our primary goal is to select variables and identify the EDR directions. Thus we focus on learning 
gradient functions rather than the regression function itself. 
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2.3. Error analysis 



Next we investigate the statistical performance of the sparse gradient learning with a Gaussian weight in 
Eq. (5). Assume that the data Z = {(xj, are i.i.d drawn from a joint distribution p, which can be 

divided into a marginal distribution px and a conditional distribution p{y\x). Denote f p to be the regression 
function given by 

f P {x) = J ydp(y\x). 

We show that under certain conditions, fz — > V/„ as n — > oo for suitable choices of the parameters A and 
s that go to zero as n — > oo. In order to derive the learning rate for the algorithm, some regularity conditions 
on both the marginal distribution and V/ p are required. 

Denote dX be the boundary of X and c?(x, dX)(x £ X) be the shortest Euclidean distance from x to dX, 
i.e, <i(x, dX) = mi ye g X d(x, y). 

Theorem 1. Suppose the data Z = {(xi,j/i)}™ =1 are i.i.d drawn from a joint distribution p and yi < M 
for all i for a positive constant M . Assume that for some constants c p > and < 9 < 1, the marginal 
distribution px satisfies 

px({x€X :d(x,dX)} <t) <c p t (10) 

and the density p(x) of px satisfies 

sup p(x) < c p and |p(x) — p(u)\ < c p |x — u\ e , Vu, x e X. (11) 

Let fz be the estimated gradient function given by Eq. (8) and V/ p be the true gradient of the regression 
function f p . Suppose that K <G C 2 and V/ p € H^. Choose X = X(n) = n~ p+°+ 2( > and s = sin) = n~ 2( P +2+2e> _ 
Then there exists a constant C > such that for any < J] < 1 with confidence 1 — r) 



4 / 1 \ 4 (p+ 2 + 2 «) 

\\fz-Vf P \\ L * <Clog- - . (12) 

Condition (11) means the density of the marginal distribution is Holder continuous with exponent 9. 
Condition (13) specifics behavior of px near the boundary dX of X. Both are common assumptions for error 
analysis. When the boundary dX is piccewisc smooth, Eq. (11) implies Eq. (13). Here we want to emphasize 
that our terminology sparse gradient for the derived fz comes from this approximation property. Since we 
treat each component of the gradient separately in our estimation algorithm, fz does not necessarily satisfy 
the gradient constraint Q x iQ x j = Q x iQ x i f° r au * an d j. However, we note that it is possible to add these 
constraints explicitly into the convex optimization framework that we will describe later. 

The convergence rate in Eq. (15) can be greatly improved if we assume that the data are lying in or near 
a low dimensional manifold [25, 33]. In this case, the learning rate in the exponent of 1/n depends only on 
the dimension of the manifold, not the actual dimension of the Euclidean space. 

Denote dx be the metric on X and dV be the Ricmannian volume measure of M. Let dX be the 
boundary of X and dx(x, dX)(x G X) be the shortest distance from x to dX on the manifold X. Denote 
(d<i>)* is the dual of c?4> q and (g?<I>)* maps a p-dimensional vector valued function / to a vector field with 

(d*)*/(q) = (^)*(/(q)) [34]. 

Theorem 2. Let X be a connected compact C°° submanifold of MP which is isometrically embedded and of 
dimension d. Suppose the data Z = {(x^, yi)}f = i are i.i.d drawn from a joint distribution p defined on X xY 
and there exists a positive constant M such that yi < M for all i. Assume that for some constants c p > 
and < 9 < 1, the marginal distribution px satisfies 

px({x£l :d x {x,dX)} <t)<c p t (13) 

and the density p(x) = dp ^y exists and satisfies 

supp(x)<c p and |p(x) — p(u)| < c p dx(x, u) e , Vu,x6l. (14) 
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Let fz be the estimated gradient function given by Eq. (8) and V xfp be the true gradient of the regression 

function f p . Suppose that JC G C 2 (X x X), f p G C 2 (X) and d<S>{V x f P ) G H^. C?/ioose A = A(n) = rTwfcar 
i 

and s = s(n) = n 2 ( d + 2 + 2e ) . TTien t/iere exists a constant C > shc/i f/iaf /or any < r/ < 1 with confidence 

1 — 7? 

g 

||(d$r/z-Vx/plU= <Clog- - . (15) 



Note that the convergence rate in Theorem 2 is exactly the same as the one in Theorem 1 except that we 
replaced the Euclidean dimension p by the intrinsic dimension d. 

The constraints V x f P G in Theorem 1 and #(V X / P ) G are somewhat restrictive, and extension 
to mild conditions is possible [25]. Here we confine oursclf to these conditions in order to avoid introducing 
more notations and conceptions. 

The proof of Theorem 1 and Theorem 2 are somewhat complicated and will be given in the Section 3. 
The main idea behind the proof is to simultaneously control the sample error and the approximation error; 
see section 3 for details. 



2-4- Variable selection and effective dimension reduction 

Next we describe how to do variable selection and extract EDR directions based on the learned gradient 

fz = (/£,.. -,/ff. 

As discussed above, because of the l\ norm used in the regularization term, we expect many of the entries 
in the gradient vector fz be zero functions. Thus, a natural way to select variables is to identify those entries 
with non-zeros functions. More specifically, we select variables based on the following criterion. 

Definition 1. Variable selection via sparse gradient learning is to select variables in the set 

S := {j : H/ilk + 0, j = l,...,p} (16) 

where fz = (fz> ■ ■ ■ > fz) T * s ^ e estimated gradient vector. 

To select the EDR directions, we focus on the empirical gradient covariance matrix defined below 

(fkJz)K] P ■ (17) 

The inner product (f z , f 3 z )ic can be interpreted as the covariance of the gradient functions between coordinate 
i and j. The larger the inner product is, the more related the variables x l and are. Given a unit vector 
u G M. p , the RKHS norm of the directional derivative ||u • fz\\ic can be viewed as a measure of the variation 
of the data Z along the direction u. Thus the direction Ui representing the largest variation in the data is 
the vector that maximizes ||u • fzWfc- Notice that 

iiu • fzwi = \\J2 u >fz\\l = E wsifafi)*: = u T su. 

i i-.j 

So Ui is simply the eigenvector of S corresponding to the largest eigenvalue. Similarly, to construct the second 
most important direction 112, we maximize ||u-/z||k; in the orthogonal complementary space of spanjui}. By 
Courant-Fischcr Minimax Theorem [35] , U2 is the eigenvector corresponding to the second largest eigenvalue 
of S. We repeat this procedure to construct other important directions. In summary, the effective dimension 
reduction directions are defined according to the following criterion. 

Definition 2. The d EDR directions identified by the sparse gradient learning are the eigenvectors {ui, . . . , u^} 
of S corresponding to the d largest eigenvalues. 

As we mentioned in section 2.1, the EDR space is spanned by the eigenvectors of the gradient outer 
product matrix G defined in Eq. (3). However, because the distribution of the data is unknown, G cannot 
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be calculated explicitly. The above definition provides a way to approximate the EDR directions based on 
the empirical gradient covariance matrix. 

Because of the sparsity of the estimated gradient functions, matrix S will appear to be block sparse. 
Consequently, the identified EDR directions will be sparse as well with non-zeros entries only at coordinates 
belonging to the set S. To emphasize the sparse property of both S and the identified EDR directions, we will 
refer to 3 as the sparse empirical gradient covariance matrix (S-EGCM), and the identified EDR directions 
as the sparse effective dimension reduction directions (S-EDRs). 

3. Convergence Analysis 

In this section, we will give the proof of Theorem 1 and Theorem 2. 
3.1. Convergence Analysis in the Euclidean Setting 

Note that our energy functional in (8) involves an nonsmooth regularization term J2i ||/ l ||jt- The method 
for the convergence analysis used in [22] can no longer be applied any more since it need explicit form of 
the solution which is only possible for the I 2 regularization. However, we can still simultaneously control a 
sample or estimation error term and a regularization or approximation error term which is widely used in 
statistical learning theory [23, 31, 36]. 

3.1.1. Comparative Analysis 

Recall the empirical error for a vector function f := (f 1 , . . . , f p ), 

1 n 

£z(f) = ^iJ2 w «(w - V] + /( x • ( x j - x *)) 2 - 

One can similarly define the expected error 

£(f)= [ [ oj s (x-u)(y-v + f(x)(u-x)) 2 dp(x,y)dp(u,v). 
J z J z 

Denote 

a l= \ / ujS ( x - u )(y - f P (x)) 2 dp(x,y)dp x (u). 
Jx Jz 

Then £(f) = 2a 2 + J x J x cj(x - u)[/ p (x) - f p (u) + /(x)(u - x)} 2 d Px (x)d Px (u). 

Note that our goal is to bound the L 2 x differences of / and V/ p . We have the following comparative 

theorem to bound the L 2 differences of / and V/ p in terms of the excess error, 8(f) — 2a 2 using the 
following comparative theorem. 
For r > 0, denote 

^ = {/e^:£||/%<r}. 
»=x 

Theorem 3. Assume px satisfies the condition (10) and (11) and V/ p G For f £ T r with some r > 1, 
there exist a constant Co > such that 

11/ - Vf p \\ L , x < C„(rV + s 2 ~° + -^(£(f) 2a 2 )). 

To prove Theorem 3, we need several lemmas which require the notations of the following quantities. 
Denote 

Q(f) = f [ W (x- u)((/(x) - V/ P (x))(u - X )) 2 d Px ( X )dp x (u), 
Jx Jx 
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the border set 

I s = {xel: d(x, dX) > s and p(x) > (1 + c p )s e } 
and the moments for < q < oo 

M g = / e-^Htfdt, M q = 



t <i 



Note that X s is nonempty when s is small enough. 
Lemma 1. Under assumptions of Theorem 3, 

M 2S P+ 2+e 



P Jx 



||/(x)-v/ p (x)||^(x)<g(/) 



Proof. For x € X s , we have d(x,dX) > s and p(x) > (1 + c p )s e . Thus {u e X : |u - x| < s} C X and for 
ue{uel:|u-x|<s}, p(u) = p(x) - (p(x) - p(u)) > (1 + c p )s 9 - c p |u - x| e > s e . Therefore, 



Q(/) > / / W s (x-u)((/(x)-V/ p (x))(x-u))^(u)dud Px (x) 

JX, J\\x—u\\<a 

> s 9 I I ^ s (x - u)((/(x) - V/ P (x))(x - xi)) 2 dudp x (x). 

Jx s J||x-u||<s 

Denote the i-th entry of a vector x by x % . Then ((/(x) — V/ P (x))(x — u)) 2 equals to 

E E(/ l w - §§( x ))(/ J ( x ) - B (x))(xi ~ uI)(aJ - 
i=i i=i 

For the case i ^ j, we have 

f w s (x - u)(x i - u l )(x J - u j )du = s p+2 [ e-^tH j dt = 0. 

Therefore, 

Q(f) > s p+2+e ir[ (r(x)-|4(x)) 2 dpx(x) / e-^iffdt 
T^Jx, ox 1 i\\t\\<\ 



i=i 



M 2 sP+ 2+e 



J ||/(x)-V/ p (x)!| 2 d Px (x) 



P Jx, 

which yields the desired estimate. □ 
Lemma 2. Under the assumption of Theorem 3, we have 

Q(f)<C 1 (s A+ P+£(f)-2a 2 ), 

where C\ is a constant independent of s or f. 

Proof. Denote ai = (/(x) - V/ P (x))(u - x) and a 2 = / p (x) - f p (u) + V/ P (x)(u - x). We have Q(f) = 
Ix Ix uS ( x _ u ) f d Px (*) d Px (u) and 



£(/)=/ / u: s (x-u)(a 1 +a 2 ydp x (x)dp x (u) + 2o-i. 
Jx Jx 

Note that («i + a 2 ) 2 > (ai) 2 - 2||ai|j||a 2 ||. Thus 

£{f) - 2a 2 s > Q(f) - 2 / / w*(x-u)||ai||||a2||dpx(x)dpx(u). 



'X Jx 
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By the fact V/ p <E and lemma 19 in [23], there exists a constant C/c > depending on JC and f p such 
that 

IKH <C K ||x-u|| 2 . 
Together with the assumption p(x) < c p , we have 

/ / w s (x - u)||ai||||a 2 ||d/9x(x)d/0A:(u) 

< \/Q(/5(/ / u s (x- U )\\a 2 fdp x (x)dpx(u))i 

J x Jx 

< C k Jq(J){c p I I W s (x-u)||x-u|| 4 rfxdpx(u))5 

JX JRP 

< C^^IT^+P^^Qif). (18) 
Combining the above arguments, we obtain 



Q(f) - C K ^M~ 4 s 2+ ^ 2 jQ(f) < £(f) - 2a 2 



This implies the conclusion with C\ = 2 max{ C^c p M^, 1}. □ 

Denote K = sup^gx y/ K(x, x), D = max x . ue x ||x — u|| . 
Proof of Theorem 3. Write 



PX 



Il/-V/J£ 2 =/ ||/(x)-V/ p (x)||^x(x)+ / ||/(x)-V/ p (x)|| 2 dpx(x). (19) 
>X\X S Jx s 



We have 

Px(X\X s ) < c p s + (1 + c p )c p \X\s e < (c p + (1 + c p )c p \X\)s , 
where \X\ is the Lebesgue measure of X. So the first term on the right of (19) is bounded by 

* 2 (r + ||V/ p |k) 2 (c p + (1 + c p )c p \X\)s e . 

By lemma 1 and lemma 2, the second term on the right hand of (19) is bounded by 

Combining these two estimates finishes the proof of the claim with 

Co = k 2 (1 + || V/ P |k) 2 (c p + (1 + c p )c p \X\) + V £±. 

This is the end of the proof. 
3.1.2. Error Decomposition 

Now we turn to bound the quantity £{fz) — 2tr 2 . Note that unlike the standard setting of regression and 
classification, £z{f) an d £{f) are not respectively the expected and empirical mean of a random variable. 
This is due to the extra dp(u, v) in the expected error term. However, since 

Ez£z(f) = —£(f), 



£z(f) and £(f) should be close to each other if the empirical error concentrates with n increasing. Thus, we 
can still decompose £(fz) — 2o~ 2 into a sample error term and an approximation error term. 
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Note that n(f) = A £\ ||k with / = (f 1 , /?), so the minimizer of £(f) + fi(/) in H£. depends on 
A. Let 

/a = arg min {£(/) + Q(f)}. (20) 

By a standard decomposition procedure, we have the following result. 
Proposition 1. Let 

<p(Z) = (£(f z ) - £ z (fz)) + (£ z (h)) - £{fx)) 

and 

A(X) = jnf {£(/) - 2al + Q(f)}. 

/em* 

Then, we have 

£{fz) - 2a 2 s < £{f z ) - 2a\ + Q(f z ) < <p(Z) + A(X) 
The quantity ip(Z) is called the sample error and A(X) is the approximation error. 

3.1.3. Sample Error Estimation 

Note that the sample error ip(Z) can be bounded by controlling 

S(Z,r) := sup \£ z (f) - £(f)\. 

In fact, if both f z and f\ are in T r for some r > 0, then 

<p(Z)<2S(Z,r). (21) 

We use McDiarmid's inequality in [37] to bound S(Z,r). 
Lemma 3. For every r > 0, 



Prob{\S(Z,r) - ES(Z,r)\ > e} <2cxp 



9 

ne 



32 (Af + nDr) 4 



Proof. Let (x',y') be a sample i.i.d drawn from the distribution p(x, y). Denote by Z[ the sample which 
coincides with Z except that the i-th entry (xj, y^) is replaced by (x', y'). It is easy to verify that 

S(Z, r) - S(Zl, r) = sup \£ z (f) - £(f)\ - sup \£ z ,(f) - £(f)\ 

< BU p \sM)-s z ,u)\< i{2,l - m l +KDTf - m 

Interchange the roles of Z and Z- gives 

\S(Z,r)-S(Zir)\< 8 M±^l. 

n 

By McDiarmid's inequality, we obtain the desired estimate. □ 

In order to bound S(Z, r) using Lemma 3, we need a bound of ES(Z, r). 
Lemma 4. For every r > 0, 

ES(Z,r)< n ^ M ^. 
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Proof. Denote £(x,y,u,v) = w s (x - u)(y - v + /(x)(u - x)). Then £(/) = E { ^ v) E {x , y) £(x,y,u 7 v)} and 

£z(f) = ^2 YJi,j=l £( x i> Vii X i:%)- 0ne Can easil y Cneck that 

1 ™ 

S^r) < sup ^^ (XjS) ^(x,y,Xj,y 3 -)| 

n j'=i 

1 " 



sup 



< sup E^ y) 



1 n 

^(u,t>)C(x,y,u, u) - -^$(x,y,x 3 , yi ) 



3=1 



1 

+ — > sup sup 

77 ^ ' - / \ _ . 



Ei 



1 " 

h)^(x,j/,u,u) — - £(Xj,2/i,U,v) 



:= 5!+5 2 + 5 3 . 

Let ej,i = 1, • • • , n be independent Radcmacher variables. For Si, by using the properties of Radcmacher 
complexities [38], we have 



ESi(Z) = -B( x , a ) sup 



< 2 sup E sup 



< 4(M + kDt) 
5(nDr + Mf 



1 " 

S( u ,„)^(x, y, u, v) Z ( x > ^ X J ' 

1 - - 

- e 3 ,u;S ( x _ x i)(%' - y + /( x j)( x - x j)) 2 

(1 - - M \ 

sup E sup - V ej (yj -y + f(x j )(x-x j )) + —\ 
(x, v )ez /gjv n j=1 V n y 



< 



Similarly, we can verify ES2(Z) < 5 ^ D ^t M " > . Obviously, S3 < i - AI+f ^ Dr '> _ Combining the estimates for 
Si, 5*2 and S3, we can get the desired estimate 

> lOUDr + M) 2 (M + nDr) 2 11(M + kDt) 2 
ES(Z,r) < -= j < -= . 



□ 



Proposition 2. Assume r > 1. There exists a constant C3 > such that with confidence at least 1 — 6, 

, (nDr + M) 2 log I 
V (Z) < C 3 K - ' &s . 

v n 

Proof. The result is a direct application of inequality (21), lemma 3 and lemma 4. 



□ 



Note that in order to use this Proposition, we still need a bound on Cl(fz) = II/zIIk- We nrs * state 
a rough bound. 

Lemma 5. for every s > and A > 0, Q(/z) < M 2 . 
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Proof. The conclusion follows from the fact 

< £z(fz) + n(f z ) < £ Z (Q) < M 2 . 

□ 

However, using this quantity the bound in Theorem 3 is at least of order 0( x2s 2p+4-i ) which tends to oo 
as s — > and A — > 0. So a sharper bound is needed. It will be given in Section 3.1.5. 

3.1.4- Approximation Error Estimation 

We now bound the approximation error „4(A). 

Proposition 3. J/V/ p £ H£, then A(X) < C 4 (A + s 4+ p) for some C 4 > 0. 
Proof. By the definition of -4(A) and the fact that V/ p £ H^, 

A(X)<S(Vf p )-2a 2 s +n(Vf p ). 

Since 

£(V/ p )-2a 2 = f f ^(x-u)(/ p (x)-/ p (u)+V/ p (x)(u-x)) 2 dp x (x)d Px (u) 

< (Cfcfcp f f w s (x-u)j|u-x|| 4 dudp A -(x) 

JX 

< (Cyc) 2 c p Af 4S 4+p . 

Taking C 4 = max{(CK;) 2 c p M4, J^i=i II (VfpTWic}, we get the desired result. □ 
3.1.5. Convergence rate 

Following directly from Proposition 1, Proposition 2 and Proposition 3, we get 

Theorem 4. // V/ p £ H^-, /g and /a are m J> /or some r > 1, /j/ien wii/i confidence 1 — 5 

W,)-2^<C 2 ( ( " + ^' 2 '°^ + ,^ + A), 

where C2 is a constant independent of r, s or A. 

In order to apply Theorem 3, we need a sharp bound on £l{fz) '■= AJ^ 
Lemma 6. Under the assumptions of Theorem 1, with confidence at least 1 — 5 

a (&SC5 (, + ^ + ( 1 + ^) 2 ^) 

for some C5 > independent of s or A. 

Proof. By the fact £{f z ) - 2cr 2 > and Proposition 1, we have tt(fz) < \{<p{ z ) + 
and f\ are in , using Proposition 2, we have with probability at least 1 — 5, 

X 

Together with Proposition 3, we obtain the desired estimate with C5 = max{C3, C4}. 
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. Since both fz 



□ 



Lemma 7. Under the assumptions of Theorem 1, 

fi(£) <C 4 (A + s 4+p ), 

where C4 is a constant independent of X or s. 

Proof. Since £(f\) — 2er 2 is no n- negative for all /, we have 

< £(/a) - 2a s 2 + AO(/ A ) = A(X). 

This in conjunction with proposition 3 implies the conclusion. □ 
Now we will use Theorem 3 and Theorem 4 to prove Theorem 1. 



Proof of Theorem 1: By Theorem 3 and Theorem 4, we have with at least probability 1 

f, fl 9fl C 2 / (M + nDr) 2 log § 
\\fz V /p ||l ?x < Co r 2 s° + ,s 2 - e + K - jJ- + + A 



5 



in 

if both fz and f\ are in J> for some r > 1. By lemma 7 and lemma 6, we can state that both fz and f\ 

s 
2 

2 fl/T2i„„4 



are in T r with probability 1 — | if 



A 



s i+ P ( kDMY MHog 

r=max \ 1+ —{ 1 + —) 

Choose s = (i) 2<!) +" +2£, » , A = (i) , we obtain with confidence 1 - 5, 

||/z-V/„|U» <C 



1 ^ 4(p + 2 + 28) 



3.2. Convergence Analysis in the Manifold Setting 



The convergence analysis in the Manifold Setting can be derived in a similar way as the one in the Euclidean 
setting. The idea behind the proof for the convergence of the gradient consists of simultaneously controlling 
a sample or estimation error term and a regularization or approximation error term. 

As done in the convergence analysis in the Euclidean setting, we first use the excess error, £(f) — 2a 2 , to 
bound the L 2 x differences of V x f P and (d<f>)*(/). 

Recall 

j;. = {/eH£:]T||f!k<r}, r>0. 

i=l 

Theorem 5. Assume px satisfies the condition (13) and (14) and Vx/p € C 2 (X). For f G J- r with some 
r > 1, there exist a constant Co > such that 

\\(d<f>)*(f) - Vxf P \\h px < C (r 2 s s + - 2a 2 )). 

Proof. It can be directly derived from lemma B.l in [25] by using the inequality J27=i \ v i\ 2 — E™=i l w il) 2 - D 



3.2.1. Excess Error Estimation 

In this subsection, we will bound £(fz) — 2tr 2 . First, 
approximation error. 



we decompose the excess error into sample error and 
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Proposition 4. Let fx be defined as (20), 

<p(Z) = (£(f z ) - £ z (fz)) + (£ z {h) - £{h)) 

and 

A{\)= mf \£{f)-2o 2 s +n{f)\. 

Then, we have 

£{fz) - 2a 2 + Sl(fz) < <p(Z) + A(X). 

Since the proof of Proposition 2 doesn't need any structure information of X, it is still true in the manifold 
setting. Thus we have the same sample error bound as the one in the Euclidean setting. What left is to give 
an estimate for the approximation error A(X) in the manifold setting. 

Proposition 5. Let X be a connected compact C°° submanifold of MP which is isometrically embedded and 
of dimension d. If f p G C 2 (X) and d<f>(V x fp) € then 

A(X)<C 6 (X + s i+d ) 

for some C$ > 0. 

Proof. By the definition of A(X) and the fact that d$(V x f P ) G H? K , 

A{X) < £ (d$(Vjc/ p )) - 2a 2 s + n(d$(Vx/,)). 

Note that f p G C 2 (X) and d${V x f p ) G H V K . By Lemma B.2 in [25], we have 

£(d$(Vxf P )) - 2a 2 < C 7 s 4+d , 

where C7 is a constant independent of s. Taking Cq = max{C7,^f =1 \\(d§(\7 x fp)) l \\ic}, we get the desired 
result. □ 

Combining Proposition 4, Proposition 2 and Proposition 5, we get the estimate for the excess error. 

Theorem 6. If d<f>(V f p ) G H^, fz and f\ are in T r for some r > 1, then with confidence 1 — S. 

( (M + KDr) 2 log I ... \ 
£{fz) - 2a 2 < C 8 ^ Ba + s d+i + Xj , 

where C$ is a constant independent of s, A, 6 or r. 
3.2.2. Convergence Rate 

In order to use Theorem 5 and Theorem 6, we need sharp estimations for X)f=i II (d$>(V x fp)) l \\ic an d 
Sf=i II/aII'C- This can be done using the same argument as the one in the Euclidean setting, we omit 
the proof here. 

Lemma 8. Under the assumptions of Theorem 2, with confidence at least 1 — 8, 
and 

n(h)<c 9 (x + s 4+d ), 

where Cg is a constant independent of X or s. 
Now we prove Theorem 2. 

Proof of Theorem 2: By the same argument as the one in proving Theorem 1, we can derive the 
convergence rate using Theorem 5, Theorem 6 and Lemma 8. 
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4. Algorithm for solving sparse gradient learning 

In this section, we describe how to solve the optimization problem in Eq. (8). Our overall strategy is to 
first transfer the convex functional from the infinite dimensional to a finite dimensional space by using 
the reproducing property of RHKS, and then develop a forward-backward splitting algorithm to solve the 
reduced finite dimensional problem. 



4-1- From infinite dimensional to finite dimensional optimization 

Let K, : M. p x K p — > M. p be continuous, symmetric and positive semidefinite, i.e., for any finite set of distinct 
points {xi, ■ • • ,x n } C W p , the matrix [K.(x.i, Xj)]™. =1 is positive semidefinite [27]. Such a function is called 
a Mercer kernel. The RKHS H^; associated with the Mercer kernel JC is defined to be the completion of 
the linear span of the set of functions {/C x := /C(x, •) : x G R n } with the inner product (•, satisfying 
(JCx.) fcu)K = ^( x , u). The reproducing property of Hyc states that 

(£*, h) K = h{x) Vx eW,he U K . (23) 

By the reproducing property (23), we have the following representer theorem, which states that the 
solution of (8) exists and lies in the finite dimensional space spanned by {/C Xi }™=i- Hence the sparse gradient 
learning in Eq. (8) can be converted into a finite dimensional optimization problem. The proof of the theorem 
is standard and follows the same line as done in [22, 39]. 

Theorem 7. Given a data set Z, the solution of Eq. (8) exists and takes the following form 

n 

/i(x) = £4 z /C(x,x,), (24) 
i=l 

where c? z G K for j = 1 , . . . , p and i = 1 , . . . , n. 

Proof. The existence follows from the convexity of functionals £z(f) an d ^(/)- Suppose fz is a minimizer. 
We can write functions fz € % P K as 

fz = f\\+f±, 

where each element of /y is in the span of { A Xl , • • • , K Xn } and f± are functions in the orthogonal complement. 
The reproducing property yields /(x,) = /||(xj) for all Xj. So the functions f± do not have an effect on 
£z(f)- But \\fz\\i< = ||/|| + I±\\k > unless f± = 0. This implies that fz = f\\, which leads to the 

representation of fz in Eq. (24). □ 

Using Theorem 7, we can transfer the infinite dimensional minimization problem (8) to an finite dimen- 
sional one. Define the matrix Cz '■= [cj z ] j=i i—\ € M px ™. Therefore, the optimization problem in (8) has 
only p x n degrees of freedom, and is actually an optimization problem in terms of a coefficient matrix 
C := [c?']f=i,i =1 e RPX "- Write C int0 column vectors as C := (ci, . . . ,c„) with c 4 e MP for i = 1, • ■ ■ , n, 
and into row vectors as C := (c 1 ,...,c p ) T with c J G M" for j = 1, • • ■ ,p. Let the kernel matrix be 
K := [£(x,;,Xj)]™l™ - =1 G R™ xn . After expanding each component / 3 of / in (8) as / 3 (x) = X)"=i c i^-( x ' x i)i 
the objective function in Eq. (8) becomes a function of C as 

<&(c) = ^(/) + n(/) 

- n p n p 

Z2 E <,(y,-y J + EE c " /c ( x ^ x ^)(4-^)) 2 + A E 



ij'=l fc=l 1=1 



= 1 \ i,k=l 



-2 E <j (» - vj + E x *)( x j - x T c0 2 + A E \f^VK^ 

n i,j=i e=i j=l 
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1 ™ p / 

-5 2 <(W - % + - ^) T Ck 2 ) 2 + y/WTKcfi, (25) 



where k 4 G 1" is the i-th column of K, i.e., K = (k 1; . . . , k„). Then, by Theorem 7, 

C z = arg min $(C). (26) 

^.2. Change of optimization variables 

The objective function $(C) in the reduced finite dimensional problem convex is a non-smooth function. 
As such, most of the standard convex optimization techniques, such as gradient descent, Newton's method, 
etc, cannot be directly applied. We will instead develop a forward-backward splitting algorithm to solve the 
problem. For this purpose, we fist convert the problem into a simpler form by changing the optimization 
variables. 

Note that K is symmetric and positive semidefinite, so its square root K 1 / 2 is also symmetric and pos- 
itive semidefinite, and can be easily calculated. Denote the i-th column of K 1 ^ 2 by k^ 2 , i.e., K 1 ! 2 = 
(k^ 2 , . . . , k^ 2 ). Let C = CK 1 / 2 and write C = (c 1; . . . ,c„) = (c 1 , . . . ,c p ) T , where and cP are the i-th 
column vector and j-th row vector respectively. Then <&(C) in Eq. (25) can be rewritten as a function of C 

-, n p 

= ^ E " VJ + - **) r C*i V3 ) a + A E H^'lla. ( 27 ) 

».J=1 3=1 

where || • H2 is the Euclidean norm of K p . Thus finding a solution C2 of (26) is equivalent to identifying 

C z = arg min *(C), (28) 



followed by setting Cz = C^if ^-Z 2 , where if 1 / 2 is the (pseudo) inverse of K 1 / 2 when if is (not) invertiblc. 
Given matrix Cz, the variables selected by the sparse gradient learning as defined in Eq. (16) is simply 

S = {j: \\V\\ 2 ^0,j = l,--- ,n}. (29) 

And similarly, the S-EDR directions can also be directly derived from Cz by noting that the sparse gradient 
covariance matrix is equal to 

3 = ClKC z = ClCz- (30) 



4-3. Forward- backward splitting algorithm 

Next we propose a forward-backward splitting to solve Eq. (28). The forward-backward splitting is commonly 
used to solve the £1 related optimization problems in machine learning [40] and image processing [29, 41]. 
Our algorithm is derived from the general formulation described in [42] . 

We first split the objective function <]/ into a smooth term and a non-smooth term. Let VP = "fi + ^2, 
where 

P 1 n 

= A£ ||c*|| 2 and * a (0 = ^ E <j(Vi % + ( x j - xO T Ck 4 1/2 ) 2 . 

i—l 

The forward-backward splitting algorithm works by iteratively updating C. Given a current estimate C^ k \ 
the next one is updated according to 

C^ = prox 5vti (C« - <JV* 2 (C (fc) )), (31) 
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where 5 > is the step size, and prox 5 ^ i is a proximity operator defined by 

1 



prox^p) =arg^min -\\D - C\\% + <^i(<7), (32) 



where || • \\f is the Frobenius norm of K px ". 

To implement the algorithm (31), we need to know both V^2 and prox 5 ^, i (■). The term V^2 is relatively 
easy to obtain, 

_ 9 " 

Vv W) = ^ E Vi + - x 2 ) T Ck- /2 )(x, - xO(kl /2 f. (33) 

The proximity operator prox (5 ^ i is given in the following lemma. 

Lemma 9. Let T\$(D) = prox 5 ^, i (D), where D = (d 1 , ...,d p ) T with d J being the j-th row vector of D. 
Then 

Txs(D)= (< A5 (d 1 ),...,i A5 (d p )) T , (34) 

where 

(O, if \\di\\ 2 <\8, 

txs{d3) = iW idJ '. * i' d ^> A *- (35) 

Proof. From (32), one can easily see that the row vectors cP , j = 1, . . . , n, of C are independent of each 
others. Therefore, we have 

txsid ) = arg min -||d J - c|| 2 + \5\\c\\ 2 . (36) 

c6R" Z 

The energy function in the above minimization problem is strongly convex, hence has a unique minimizer. 
Therefore, by the subdiffercntial calculus (c.f. [43]), t\s{d^) is the unique solution of the following equation 
with unknown c 

0ec-d J +A^(||c|| 2 ), (37) 

where 

0(||c|| 2 ) ={p:pe K"; ||u|| 2 - ||c|| 2 - (u - c) T P > 0, Vu e R"} 

is the subdiffercntial of the function ||c|| 2 . If ||c|| 2 > 0, the function ||c|| 2 is differentiablc, and its subdifferen- 
tial contains only its gradient, i.e., <9(||c|j 2 ) = {^y^}- 11 || c l|2 = 0, then 9(||c|| 2 ) = {p : p £ M"; ||u|| 2 — u T p > 
0, Vu £ R™}. One can check that 9(||c|| 2 ) = {p : p £ M. n ; ||p|| 2 < 1} for this case. Indeed, for any vector 
p £ R" with ||p|| 2 < 1, ||u|| 2 — u T p > by the Cauchy- Schwartz inequality. On the other hand, if there is an 
element p of i9(||c|| 2 ) such that ||p|| 2 > 1, then, by setting u = p, we get ||p|| 2 - p T p = ||p||2(l — 1 1 p» | j 2 ) < 0, 
which contradicts the definition of 9(||c|| 2 ). In summary, 

9(||c||,) = {W' if I|C||2> °' (38) 

\{p:p£M»;||p|| 2 <l}, if ||c|| 2 = 0. 

With (38), we see that t\s(d : ') in (35) is a solution of (37) hence (34) is verified. □ 

Now, we obtain the following forward-backward splitting algorithm to find the optimal C in Eq. (26). 
After choosing a random initialization, we update C iteratively until convergence according to 

The iteration alternates between two steps: 1) an empirical error minimization step, which minimizes the 
empirical error £z(f) along gradient descent directions; and 2) a variable selection step, implemented by the 
proximity operator T\s defined in (34). If the norm of the j-th row of D^ k \ or correspondingly the norm 
||/ J '||;c of the j-th partial derivative, is smaller than a threshold A<5, the j-th row of D' fc ' will be set to 0, i.e., 
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the j-th variable is not selected. Otherwise, the j-th row of will be kept unchanged except to reduce its 
norm by the threshold A<5. 

Since ^(C) is a quadratic function of the entries of C, the operator norm of its Hessian ||V 2 4 , 2|| is a 
constant. Furthermore, since the function ^ is coercive, i.e., ||C||f — >• oo implies that ^(C) — > oo, there 
exists at least one solution of (28). By applying the convergence theory for the forward-backward splitting 
algorithm in [42], we obtain the following theorem. 

Theorem 8. If < S < || v^,| | > then the iteration (39) is guaranteed to converge to a solution of Eq. (28) 
for any initialization . 

The regularization parameter A controls the sparsity of the optimal solution. When A = 0, no sparsity 
constraint is imposed, and all variables will be selected. On the other extreme, when A is sufficiently large, 
the optimal solution will be C = 0, and correspondingly none of the variables will be selected. The following 
theorem provides an upper bound of A above which no variables will be selected. In practice, we choose A to 
be a number between and the upper bound usually through cross-validation. 

Theorem 9. Consider the sparse gradient learning in Eq. (28). Let 



max — 

i<fe<p n 2 



E 



(40) 



Then the optimal solution is C = for all A > \ m ax, that is, none of the variables will be selected. 

Proof. Obviously, if A — oo, the minimizer of Eq. (27) is a p x n zero matrix. 

When A < oo, the minimizer of Eq. (27) could also be a p x n zero matrix as long as A is large enough. 
Actually, from iteration (39), if we choose C' - 1 = 0, then 



D 



(i) 



2/ J )(x J -x,)(kf) 5 



and C« = T X5 (DW). 
Let 



max — 

i<fc<p n 1 



% )(x*-x*)(k. 



2\T 



Then for any A > X ma x, we have C*- 1 -' = pxn by the definition of T\s- By induction, = Opxn and the 



algorithm converge to 
desired result. 



Opxn which is a minimizer of Eq. (27) when < 8 < 



V 2 *2 



We get the 
□ 



Remark 4. In the proof of Theorem 9, we choose = pX n as the initial value of iteration (39) for 
simplicity. Actually, our argument is true for any initial value as long as < 8 < ||ya 2 ^ 2 || since the algorithm 
converges to the minimizer of Eq. (27) when < 5 < ■ Note that the convergence is independent of 

the choice of the initial value. 



It is not the first time to combine an iterative algorithm with a thresholding step to derive solutions 
with sparsity (see, e.g., [29]). However, different from the previous work, the sparsity we focus here is a 
block sparsity, that is, the row vectors of C (corresponding to partial derivatives f 3 ) are zero or nonzero 
vector-wise. As such, the thresholding step in (34) is performed row-vector-wise, not entry-wise as in the 
usual soft-thresholding operator [28]. 



4-4- Matrix size reduction 

The iteration in Eq. (39) involves a weighted summation of n 2 number ofpxn matrices as defined by 
(x J -x ? ;)(k t 1/2 ) T . When the dimension of the data is large, these matrices are big, and could greatly influence 
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the efficiency of the algorithm. However, if the number of samples is small, that is, when n « p, we can 
improve the efficiency of the algorithm by introducing a transformation to reduce the size of these matrices. 
The main motivation is to note that the matrix 



(xi X n , X2 X n , . . . , X n _ i X n , X^ 



is of low rank when n is small. Suppose the rank of M x is t, which is no higher than min(n — 

We use singular value decomposition to matrix M x with economy size. That is, M x = UY^V T , where U is 
apxn unitary matrix, V is n x n unitary matrix, and E = diag(<7i, . . . , at, 0, . . . , 0) <G M™ x ". Let (3 = Y,V T , 
then 

M x = UP. (41) 
Denote /3 = {(3\, . . . , (3 n ). Then Xj — x, = U (j3j —Pi)- Using these notations, the equation (39) is equivalent 

to 

ffltW-D = cW _ 2tu E ^. =1< .( yi - w + (xj _ x l ) T C( fc )k l 1 / 2 )( / 3 J - ft)(kl /2 ) T , U9] 
y c (k+i) = Txs ( D (k+i)y ^ 

Note that now the second term in the right hand side of the first equation in (42) involves the summation 
of n 2 number of n x n matrix rather than p x n matrices. Furthermore, we calculate the first iteration of 
Eq. (42) using two steps: 1) we calculate yi — yj + (xj — x^) 7 C^kJ and store it in an n x n matrix r; 
2) we calculate the first iteration of Eq. (42) using the value r(i, j). These two strategies greatly improve 
the efficiency of the algorithm when p >> n. More specifically, we reduce the update for in Eq. (39) of 
complexity 0(n 3 p) into a problem of complexity 0{n 2 p + n A ). A detailed implementation of the algorithm 
is shown in Algorithm 1. 

Remark 5. Each update in Eq. (39) involves the summation of n 2 terms, which could be inefficient for 
datasets with large number of samples. A strategy to reduce the number of computations is to use a truncated 
weight function, e.g., 

w - = { exp(-te^), Xj G Aff, (43) 
0, otherwise, 

where = {x^- : Xj is in the k nearest neighborhood o/Xi}. This can reduce the number of summations 
from n 2 to kn. 



5. Sparse gradient learning for classification 

In this section, we extend the sparse gradient learning algorithm from regression to classification problems. 
We will also briefly introduce an implementation. 

5.1. Defining objective function 

Let x and y G { — 1, 1} be respectively M p -valued and binary random variables. The problem of classification 
is to estimate a classification function /c(x) from a set of observations Z := {(x^ ?/i)}™ =1 , where Xj := 
{x\, . . . , x v i ) T € W is an input, and j/j e { — 1,1} is the corresponding output. A real valued function 
f£ : X n- R can be used to generate a classifier /c(x) = sgn(f^(x)), where 

Similar to regression, we also define an objective function, including a data fitting term and a rcgularization 
term, to learn the gradient of /J. For classical binary classification, we commonly use a convex loss function 
4>(t) = log(l + e~*) to learn ff and define the data fitting term to be \ Ya=i ^(Uifpi^i))- The usage of 
loss function <p(t) is mainly motivated by the fact that the optimal f£ (x) = \og[P(y = l|x)/P(y = — l|x)], 
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Algorithm 1: Forward-backward splitting algorithm to solve sparse gradient learning for regression. 



Input: data {xi,j/i}^_ 1 , kernel /C(x, y), weight function o; s (x, y), parameters <5, A and matrix C' '. 
Output: the selected variables S and S-EDRs. 

1. Compute K, A" 1 / 2 . Do the singular value decomposition with economy size for the matrix M x = (xi — 
x„,...,x„ - x„) and get M x = £/EV T . Denote /3 = (/3i, . . . , fi n ) = EV T . Compute Gy = w|j(j8j - 

/3i)(kV 2 ) T ,i = 1, ...,n,j = l,...,n and let fc = 0. 

2. While the convergence condition is not true do 

(a) Compute the residual r-W = (ry ) S R nxn , where = Hi — Vj + (xj — x 4 ) T C' fc 'kj"' /2 . 

(b) Compute gW = % £^ = 1 P^G«. 

(c) Set Z)M = (5< fe ) - <5[/g( fe ). For the row vectors (d 8 )( fe ', i = 1,... ,p, of D^ k \ perform the variable 
selection procedure according to (35) to get row vectors (c 1 )'*^ 1 ) of C^"*" 1 '. 

i. If Kd 1 )'*^ ||2 < A<5, the variable is not selected, and we set (c')( fc + 1 ' = 0. 

ii. If ||(d*)'*'* ||a > A<5, the variable is selected, and we set 

(dM(Hl) = ll(d') (fc) l|2-A^ , (k) 

(d) Update C(M-l) = ((c 1 )( fe+1 ), . . . , (c ?l )( fc+1 )) T , and set k = k + 1. 
end while 

3. Variable selection: S = {i : (c*)( fc+1 ) ^ 0}. 

4. Feature extraction: let S-EGCM S = C'' fe + 1 ' ■ (C(' s + 1 ))' r and compute its eigenvectors via singular value 
decomposition of C' fc+1 ', we get the desired S-EDRs. 



representing the log odds ratio between the two posterior probabilities. Note that the gradient of exists 
under very mild conditions. 

As in the case of regression, we use the first order Taylor expansion to approximate the classification 
function/* by /*(x) « /*(x )+V//(x )-(x-x ). When x, is close to x i; //( Xi ) « /°(x i )+/(x i )- (x,— Xi), 
where / := {f 1 ,--- , f p ) with p = djf/dx^ for j = 1, ••• ,p, and /° is a new function introduced to 
approximate ffi(Xj). The introduction of f° is unavoidable since j/j is valued —1 or 1 and not a good 
approximation of /* at all. After considering Taylor expansion between all pairs of samples, we define the 
following empirical error term for classification 

1 n 

4(/°> /) == ~ 2 E + /(*) • ( Xj - X,))), (44) 

where o;| ■ is the weight function as in (5). 
For the regularization term, we introduce 

^/ ,/) = A 1 ||/ ||^ + A 2 E||.f|k. (45) 

i=l 

Comparing with the regularization term for regression, we have included an extra term Ai||/ ||^- to control 
the smoothness of the / function. We use two regularization parameters \\ and A2 for the trade-off between 
ll/°lli;and£?=ilirik. 

Combining the data fidelity term and regularization term, we formulate the sparse gradient learning for 
classification as follows 

(/!,/l) = arg min £+(f°, /) + Q(/°, /). (46) 

(/°J)6H£ +1 
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5.2. Forward-backward splitting for classification 



Using representer theorem, the minimizer of the infinite dimensional optimization problem in Eq. (46) has 
the following finite dimensional representation 

n n 
i=l i=l 

where a^zi c% z G K for i = 1, . . . ,n and j = 1, . . . ,p. 

Then using the same technique as in the regression setting, the objective functional in minimization 
problem (46) can be reformulated as a finite dimensional convex function of vector a = (qi,...,q„) t and 
matrix C = (c^)^ - =1 . That is, 

1 " _ i p 

*(a,C) = ^Y1 <j<t>(yi(<x T ki + (xj -Xi) T Ckf )) + A ia T /fa+ A 2 ^||F|| 2 . 

Then the corresponding finite dimensional convex 

(4,C|) = arg min tf(C) (47) 

can be solved by the forward-backward splitting algorithm. 

We split *(a,C) = *i + * 2 with * x = A a £? =1 P'lh and * 2 - ^E" F i w yfe(« Tk . + ( x j - 
Xj) T Ck J 2 )) + \ia T Ka. Then the forward-backward splitting algorithm for solving (47) becomes 



o 



= <*(*> -S[X £™ z} ^ 1 r- + 2X 1 Ka^ 

l+oxp(y J ((a( fc )) T k i + (x J -x I ) T C< fc !k i 2 )) 

1/2-.T 



n(fe+i) — r>0) _ su -^t.jViWi-PiKK ) (48) 

[C( fe + 1 )=T X2 ,( J D(^)) J 



"* l+cxp( W ((Q< fc )) T k I + (x J -x i ) T C( fc )k l 2 )) 



where t/, /? satisfy equation (41) with U being a p x n unitary matrix. 

With the derived C^, we can do variable selection and dimension reduction as done for the regression 
setting. We omit the details here. 



6. Examples 

Next we illustrate the effectiveness of variable selection and dimension reduction by sparse gradient learning 
algorithm (SGL) on both artificial datasets and a gene expression dataset. As our method is a kernel-based 
method, known to be effective for nonlinear problems, we focus our experiments on nonlinear settings for 
the artificial datasets, although the method can be equally well applied to linear problems. 

Before we report the detailed results, we would like to mention that our forward-backward splitting 
algorithm is very efficient for solving the sparse gradient learning problem. For the simulation studies, it 
takes only a few minutes to obtain the results to be described next. For the gene expression data involving 
7129 variables, it takes less than two minutes to learn the optimal gradient functions on a modest desktop. 



6. 1 . Simulated data for regression 

In this example, we illustrate the utility of sparse gradient learning for variable selection by comparing it 
to the popular variable selection method LASSO. We pointed out in section 2 that LASSO, assuming the 
prediction function is linear, can be viewed as a special case of sparse gradient learning. Because sparse 
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Table 1 

Frequencies of variables x 1 , x 2 , . . . , x 10 selected by SGL and LASSO in 100 repeats 



variable 


x l 


X 2 


X* 


X 4 


x b 


2» 


x' 


SB* 


x y 


x w 


SGL 


78 


100 


100 


100 


100 


7 


4 


6 


5 


2 


LASSO 


16 


100 


100 


100 


100 


25 


14 


13 


13 
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gradient learning makes no assumption on the linearity of the prediction function, we expect it to be better 
equipped than LASSO for selecting variables with nonlinear responses. 
We simulate 100 observations from the model 

y = (2.T 1 - l) 2 + x 2 + x 3 + x 4 + x 5 + e, 

where x l ,i = 1,...,5 are i.i.d. drawn from uniform distribution on [0,1] and e is drawn form standard 
normal distribution with variance 0.05. Let x l , i = 6, . . . , 10 be additional five noisy variables, which are also 

1.1. d. drawn from uniform distribution on [0, 1]. We assume the observation dataset is given in the form of 
Z := {x», Ui}\Ti, where Xj = (x},x 2 ,..., xj°) and yt = (2x\ - l) 2 + x 2 + x\ + xf + x\ + e. It is easy to see 
that only the first 5 variables contribute the value of y. 

This is a well-known example as pointed out by B. Turlach in [6] to show the deficiency of LASSO. As 
the ten variables are uncorrelated, LASSO will select variables based on their correlation with the response 
variable y. However, because (2a: 1 — l) 2 is a symmetric function with respect to symmetric axis x 1 = h 
and the variable x l is drawn from a uniform distribution on [0,1], the correlation between x 1 and y is 0. 
Consequently, x 1 will not be selected by LASSO. Because SGL selects variables based on the norm of the 
gradient functions, it has no such a limitation. 

To run the SGL algorithm in this example, we use the truncated Gaussian in Eq. (43) with 10 neighbors as 
our weight function. The bandwidth parameter s is chosen to be half of the median of the pairwisc distances 
of the sampling points. As the gradients of the regression function with respect to different variables are all 
linear, we choose /C(x, y) = 1 + xy. 

Figure 1 shows the variables selected by SGL and LASSO for the same dataset when the regularization 
parameter varies. Both methods are able to successfully select the four linear variables (i.e. x 2 ,--- , cc 4 ). 
However, LASSO failed to select x 1 and treated x 1 as if it were one of five noisy term x 6 , • • • , x 10 (Fig. lb). 
In contrast, SGL is clearly able to differentiate x 1 from the group of five noisy variables (Fig. la). 

To summarize how often each variable will be selected, we repeat the simulation 100 times. For each 
simulation, we choose a regularization parameter so that each algorithm returns exactly five variables. Table 
1 shows the frequencies of variables x 1 , x 2 , . . . , x 10 selected by SGL and LASSO in 100 repeats. Both methods 
are able to select the four linear variables, x 2 , x 3 , x 4 , x 5 , correctly. But, LASSO fails to select x 1 and treats 
it as the same as the noisy variables . This is in contrast to SGL, which is able to correctly 

select x 1 in 78% of the times, much greater than the frequencies (median 5%) of selecting the noisy variables. 
This example illustrates the advantage of SGL for variable selection in nonlinear settings. 

6.2. Simulated data for classification 

Next we apply SGL to an artificial dataset that has been commonly used to test the efficiency of dimension 
reduction methods in the literature. We consider a binary classification problem in which the sample data 
are lying in a 200 dimensional space with only the first 2 dimensions being relevant for classification and 
the remaining variables being noises. More specifically, we generate 40 samples with half from +1 class 
and the other half from —1 class. For the samples from +1 class, the first 2-dimensions of the sample data 
correspond to points drawn uniformly from a 2-dimensional spherical surface with radius 3. The remaining 
198 dimensions are noisy variables with each variable being i.i.d drawn from Gaussian distribution N(0,a). 
That is, 

x j - N[0,a), for j = 3,4, ...,200. (49) 

For the samples from —1 class, the first 2-dimensions of the sample data correspond to points drawn uniformly 
from a 2-dimensional spherical surface with radius 3x2.5 and the remaining 198 dimensions arc noisy variables 
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regularization parameter -log^ 

(a) 



LASSO parameter t 

(b) 



Fig 1: Regularization path for SGL and LASSO. Red line represents the variable x , blue lines represent the variables 
° and green lines represent noisy variables . (sl)Hk norm of each partial derivatives derived 

by SGL with respect to regularization parameter, where regularization parameter is scaled to be — log A with base 
10. (b)LASSO shrinkage of coefficients with respect to LASSO parameter t. 

with each variable x^ i.i.d drawn from 7V(0,tr) as (49). Obviously, this data set can be easily separated by a 
sphere surface if we project the data to the Euclidean space spanned by the first two dimensions. 

In what follows, we illustrate the effectiveness of SGL on this data set for both variable selection and 
dimension reduction. In implementing SGL, both the weight function and the kernel are all chosen to be 
exp(— ^ x ^"^ ) with s being half of the median of pairwise distance of the sampling points. 

We generated several datasets with different noise levels by varying a from 0.1 to 3. SGL correctly selected 
x 1 and x 2 as the important variables for all cases we tested. Furthermore, SGL also generated two S-EDRs 
that captured the underlying data structure for all these cases (Figure 2). It is important to emphasize 
that the two S-EDRs generated by SGL are the only two features the algorithm can possibly obtain, since 
the derived S-EGCM are supported on a 2 x 2 matrix. As a result, both of the derived S-EDRs are linear 
combinations of the first two variables. By contrast, using the gradient learning method (GL) reported in 
[25], the first two returned dimension reduction directions (called ESFs) arc shown to be able to capture the 
correct underlying structure only when a < 0.7. In addition, the derived ESFs are linear combinations of all 
200 original variables instead of only two variables as in S-EDRs. Figure 2(b,e) shows the training data and 
the test data projected on the derived two S-EDRs for a dataset with large noise (er = 3). Comparing to the 
data projected on the first two dimensions (Figure 2(a)(d)), the derived S-EDRs preserves the structure of 
the original data. In contrast, the gradient learning algorithm without sparsity constraint performed much 
poorer (Figure 2(c)(f)). 

To explain why SGL performed better than GL without sparsity constraint, we plotted the norms of 
the derived empirical gradients from both methods in Figure 3. Note that although the norms of partial 
derivatives of unimportant variables derived from the method without sparsity constraint are small, they are 
not exactly zero. As a result, all variables contributed and, consequently, introduced noise to the empirical 
gradient covariance matrix (Figure 3(e)(f)). 

We also tested LASSO for this artificial data set, and not surprisingly it failed to identify the right variables 
in all cases we tested. We omit the details here. 

6.3. Leukemia classification 

Next we apply SGL to do variable selection and dimension reduction on gene expression data. A gene 
expression data typically consists of the expression values of tens of thousands of mRNAs from a small 
number of samples as measured by microarrays. Because of the large number of genes involved, the variable 
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training data pr 

(a) 



aining data projected on two S-EDRs 



(b) 



training data projected on the first two ESFs 

(c) 



+ +1 

* -1 



+ , 



(d) 



test data projected on two S-EDRs 



test data projected on the first two ESFs 



(O 



(f) 



Fig 2: Nonlinear classification simulation with a = 3. (a) Training data projected on the first two dimensions, (b) 
Training data projected on two S-EDRs derived by SGL. (c)Training data projected on first two ESFs derived by 
GL. (d) Test data projected on the first two dimensions, (e) Test data projected on two S-EDRs derived by SGL. (f) 
Test data projected on first two ESFs derived by GL. 
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Fig 3: Nonlinear classification simulation with a = 3 (continued), (a) RKHS norm of empirical gradient derived by 
SGL. (b) S-EGCM for first 10 dimension, (c) Eigenvalues of S-EGCM. (d) RKHS norm of empirical gradient derived 
by GL, (e) EGCM for first 10 dimension, (f) Eigenvalues of EGCM. 
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selection step becomes especially important both for the purpose of generating better prediction models, and 
also for elucidating biological mechanisms underlying the data. 

The gene expression data we will use is a widely studied dataset, consisting of the measurements of 7129 
genes from 72 acute leukemia samples [44]. The samples are labeled with two leukemia types according to 
the precursor of the tumor cells - one is called acute lymphoblastic leukemia (ALL), and the other one is 
called acute myelogenous leukemia (AML). The two tumor types are difficult to distinguish morphologically, 
and the gene expression data is used to build a classifier to classify these two types. 

Among 72 samples, 38 are training data and 34 are test data. We coded the type of leukaemia as a binary 
response variable y, with 1 and —1 representing ALL and AML respectively. The variables in the training 
samples {x^}?^ are normalized to be zero mean and unit length for each gene. The test data are similarly 
normalized, but only using the empirical mean and variance of the training data. 

We applied three methods (SGL, GL and LASSO) to the dataset to select variables and extract the 
dimension reduction directions. To compare the performance of the three methods, we used linear SVM to 
build a classifier based on the variables or features returned by each method, and evaluated the classification 
performance using both leave-one-out (LOO) error on the training data and the testing error. To implement 
SGL, the bandwidth parameter s is chosen to be half of the median of the pairwise distances of the sampling 
points, and /C(x, y) = xy. The regularization parameters for the three methods arc all chosen according to 
their prediction power measured by leave-one-out error. 



Table 2 

Summary of the Leukemia classification results 



Method 


SGL (variable selection) 


SGL(S-EDRs) 


GL(ESFs) 


Linear SVM 


LASSO 


number of variables or features 


106 


1 


6 


7129(all) 


33 


leave one out error (LOO) 


0/38 


0/38 


0/38 


3/38 


1/38 


test errors 


0/34 


0/34 


2/34 


2/34 


1/34 



Table 2 shows the results of the three methods. We implemented two SVM classifiers for SGL using cither 
only the variables or the features returned by SGL. Both classifiers are able to achieve perfect classification 
for both leave-one-out and testing samples. The performance of SGL is better than both GL and LASSO, 
although only slightly. All three methods performed significantly better than the SVM classifier built directly 
from the raw data. 

In addition to the differences in prediction performance, we note a few other observations. First, SGL 
selects more genes than LASSO, which likely reflects the failure of LASSO to choose genes with nonlinear 
relationships with the response variable, as we illustrated in our first example. Second, The S-EDRs derived 
by SGL are linear combinations of 106 selected variables rather than all original variables as in the case of 
ESFs derived by GL. This is a desirable property since an important goal of the gene expression analysis is 
to identify regulatory pathways underlying the data, e.g. those distinguishing the two types of tumors. By 
associating only a small number of genes, S-EDRs provide better and more manageable candidate pathways 
for further experimental testing. 

7. Discussion 

Variable selection and dimension reduction are two common strategies for high-dimensional data analysis. 
Although many methods have been proposed before for variable selection or dimension reduction, few meth- 
ods are currently available for simultaneous variable selection and dimension reduction. In this work, we 
described a sparse gradient learning algorithm that integrates automatic variable selection and dimension 
reduction into the same optimization framework. The algorithm can be viewed as a generalization of LASSO 
from linear to non-linear variable selection, and a generalization of the OPG method for learning EDR di- 
rections from a non-regularized to regularized estimation. We showed that the integrated framework offers 
several advantages over the previous methods by using both simulated and real-world examples. 

The SGL method can be refined by using an adaptive weight function rather than a fixed one as in 
our current implementation. The weight function or? ■ is used to measure the distance between two sample 
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points. If the data are lying in a lower dimensional space, the distance would be more accurately captured by 
using only variables related to the lower dimensional space rather than all variables. One way to implement 
this is to calculate the distance using only selected variables. Note that the forward-backward splitting 
algorithm eliminates variables at each step of the iteration. We can thus use an adaptive weight function 
that calculates the distances based only on selected variables returned after each iteration. More specifically, 
let S^ k ' — {i : ||(c l )( fe )|| 2 7^ 0} represent the variables selected after iteration k. An adaptive approach is to 
use J2i e g(k) (x\ — x l j) 2 to measure the distance ||xj — x 3 || 2 after iteration k. 

An interesting area for future research is to extend SGL for semi-supervised learning. In many applications, 
it is often much easier to obtain unlabeled data with a larger sample size u >> n. Most natural (human or 
animal) learning seems to occur in semi-supervised settings [45]. It is possible to extend SGL for the semi- 
supervised learning along several directions. One way is to use the unlabeled data X = {Xi}£j^* +1 to control 
the approximate norm of / in some Sobolev spaces and introduce a semi-supervised learning algorithm as 

f 1 " 

fz,x,\,n = ar S m in \ ~ Y] u ij(Vi ~ Vi + /( x ' ( X J ~ x 0) 2 

n+u n 

+ 7^)2 E W iJf(*i) f(*j)\\% m + WlA, 

where \\/\\k = Y^=i WPWk, Wij are edge weights in the data adjacency graph, fi is another regularization 
parameter and often satisfies A = o(/i). In order to make the algorithm efficiency, we can use truncated 
weight in implementation as done in section 6.1. 

The regularization term X^=i ^j ll/( x — /( x i)ll?2(R P ) is mainly motivated by the recent work of M. 
Belkin and P. Niyogi [45]. In that paper, they have introduced a regularization term Y17t=i ^i,j(/( x i) — 
f(xj)) 2 for semi-supervised regression and classification problems. The term Y^i~j=i ^j(/( x i) ~ /( x j)) 2 
is well-known to be related to graph Laplacian operator. It is used to approximate J X&M || V^i/|| 2 dpx( x ): 
where M. is a compact submanifold which is the support of marginal distribution px(x), and V» is the 
gradient of / defined on A4 [34]. Intuitively, J xeM ||V m f\\ 2 dpx (x) is a smoothness penalty corresponding 
to the probability distribution. The idea behind J xeM || ^ Mf\\ 2 dpx{^) is that it reflects the intrinsic struc- 
ture of px(x). Our regularization term Y^ij=i ^,j'll/( x «) ~ /(x^) ||| 2 ( RP ) is a corresponding vector form of 
J27~j=i Wij(f(xi) — /(xj)) 2 in [45]. The regularization framework of the SGL for semi-supervised learning 
can thus be viewed as a generalization of this previous work. 
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